Credit Card Users Churn Prediction¶
Problem Statement¶
Business Context¶
The Thera bank recently saw a steep decline in the number of users of their credit card, credit cards are a good source of income for banks because of different kinds of fees charged by the banks like annual fees, balance transfer fees, and cash advance fees, late payment fees, foreign transaction fees, and others. Some fees are charged to every user irrespective of usage, while others are charged under specified circumstances.
Customers’ leaving credit cards services would lead bank to loss, so the bank wants to analyze the data of customers and identify the customers who will leave their credit card services and reason for same – so that bank could improve upon those areas
You as a Data scientist at Thera bank need to come up with a classification model that will help the bank improve its services so that customers do not renounce their credit cards
Data Description¶
- CLIENTNUM: Client number. Unique identifier for the customer holding the account
- Attrition_Flag: Internal event (customer activity) variable - if the account is closed then "Attrited Customer" else "Existing Customer"
- Customer_Age: Age in Years
- Gender: Gender of the account holder
- Dependent_count: Number of dependents
- Education_Level: Educational Qualification of the account holder - Graduate, High School, Unknown, Uneducated, College(refers to college student), Post-Graduate, Doctorate
- Marital_Status: Marital Status of the account holder
- Income_Category: Annual Income Category of the account holder
- Card_Category: Type of Card
- Months_on_book: Period of relationship with the bank (in months)
- Total_Relationship_Count: Total no. of products held by the customer
- Months_Inactive_12_mon: No. of months inactive in the last 12 months
- Contacts_Count_12_mon: No. of Contacts in the last 12 months
- Credit_Limit: Credit Limit on the Credit Card
- Total_Revolving_Bal: Total Revolving Balance on the Credit Card
- Avg_Open_To_Buy: Open to Buy Credit Line (Average of last 12 months)
- Total_Amt_Chng_Q4_Q1: Change in Transaction Amount (Q4 over Q1)
- Total_Trans_Amt: Total Transaction Amount (Last 12 months)
- Total_Trans_Ct: Total Transaction Count (Last 12 months)
- Total_Ct_Chng_Q4_Q1: Change in Transaction Count (Q4 over Q1)
- Avg_Utilization_Ratio: Average Card Utilization Ratio
What Is a Revolving Balance?¶
- If we don't pay the balance of the revolving credit account in full every month, the unpaid portion carries over to the next month. That's called a revolving balance
What is the Average Open to buy?¶
- 'Open to Buy' means the amount left on your credit card to use. Now, this column represents the average of this value for the last 12 months.
What is the Average utilization Ratio?¶
- The Avg_Utilization_Ratio represents how much of the available credit the customer spent. This is useful for calculating credit scores.
Relation b/w Avg_Open_To_Buy, Credit_Limit and Avg_Utilization_Ratio:¶
- ( Avg_Open_To_Buy / Credit_Limit ) + Avg_Utilization_Ratio = 1
Please read the instructions carefully before starting the project.¶
This is a commented Jupyter IPython Notebook file in which all the instructions and tasks to be performed are mentioned.
- Blanks '_______' are provided in the notebook that
needs to be filled with an appropriate code to get the correct result. With every '_______' blank, there is a comment that briefly describes what needs to be filled in the blank space.
- Identify the task to be performed correctly, and only then proceed to write the required code.
- Fill the code wherever asked by the commented lines like "# write your code here" or "# complete the code". Running incomplete code may throw error.
- Please run the codes in a sequential manner from the beginning to avoid any unnecessary errors.
- Add the results/observations (wherever mentioned) derived from the analysis in the presentation and submit the same.
Importing necessary libraries¶
!pip install scikit-learn==1.5.2
Requirement already satisfied: scikit-learn==1.5.2 in /usr/local/lib/python3.11/dist-packages (1.5.2) Requirement already satisfied: numpy>=1.19.5 in /usr/local/lib/python3.11/dist-packages (from scikit-learn==1.5.2) (1.26.4) Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn==1.5.2) (1.13.1) Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn==1.5.2) (1.4.2) Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.11/dist-packages (from scikit-learn==1.5.2) (3.5.0)
# Installing the libraries with the specified version.
# uncomment and run the following line if Google Colab is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imbalanced-learn==0.10.1 xgboost==2.0.3 -q --user
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# To tune model, get different metric scores, and split data
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
roc_auc_score,
ConfusionMatrixDisplay,
)
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
# To be used for data scaling and one hot encoding
from sklearn.preprocessing import StandardScaler, MinMaxScaler, OneHotEncoder
# To impute missing values
from sklearn.impute import SimpleImputer
from sklearn import metrics
# To oversample and undersample data
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
# To do hyperparameter tuning
from sklearn.model_selection import RandomizedSearchCV, GridSearchCV
# To define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
# To supress scientific notations for a dataframe
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To help with model building
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
AdaBoostClassifier,
GradientBoostingClassifier,
RandomForestClassifier,
BaggingClassifier,
)
!pip install xgboost==1.7.6
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
# To suppress scientific notations
pd.set_option("display.float_format", lambda x: "%.3f" % x)
# To supress warnings
import warnings
warnings.filterwarnings("ignore")
Requirement already satisfied: xgboost==1.7.6 in /usr/local/lib/python3.11/dist-packages (1.7.6) Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from xgboost==1.7.6) (1.26.4) Requirement already satisfied: scipy in /usr/local/lib/python3.11/dist-packages (from xgboost==1.7.6) (1.13.1)
# Installing the libraries with the specified version.
# uncomment and run the following lines if Jupyter Notebook is being used
# !pip install scikit-learn==1.2.2 seaborn==0.13.1 matplotlib==3.7.1 numpy==1.25.2 pandas==1.5.3 imblearn==0.12.0 xgboost==2.0.3 -q --user
# !pip install --upgrade -q threadpoolctl
Note: After running the above cell, kindly restart the notebook kernel and run all cells sequentially from the start again.
from google.colab import drive
drive.mount('/content/drive')
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Loading the dataset¶
url = "/content/drive/My Drive/Colab Notebooks/Advanced Machine Learning/Project/BankChurners.csv"
CCchurn_df = pd.read_csv(url)
data_df = CCchurn_df.copy()
Data Overview¶
- Observations
- Sanity checks
CCchurn_df.head()
| CLIENTNUM | Attrition_Flag | Customer_Age | Gender | Dependent_count | Education_Level | Marital_Status | Income_Category | Card_Category | Months_on_book | Total_Relationship_Count | Months_Inactive_12_mon | Contacts_Count_12_mon | Credit_Limit | Total_Revolving_Bal | Avg_Open_To_Buy | Total_Amt_Chng_Q4_Q1 | Total_Trans_Amt | Total_Trans_Ct | Total_Ct_Chng_Q4_Q1 | Avg_Utilization_Ratio | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 768805383 | Existing Customer | 45 | M | 3 | High School | Married | $60K - $80K | Blue | 39 | 5 | 1 | 3 | 12691.000 | 777 | 11914.000 | 1.335 | 1144 | 42 | 1.625 | 0.061 |
| 1 | 818770008 | Existing Customer | 49 | F | 5 | Graduate | Single | Less than $40K | Blue | 44 | 6 | 1 | 2 | 8256.000 | 864 | 7392.000 | 1.541 | 1291 | 33 | 3.714 | 0.105 |
| 2 | 713982108 | Existing Customer | 51 | M | 3 | Graduate | Married | $80K - $120K | Blue | 36 | 4 | 1 | 0 | 3418.000 | 0 | 3418.000 | 2.594 | 1887 | 20 | 2.333 | 0.000 |
| 3 | 769911858 | Existing Customer | 40 | F | 4 | High School | NaN | Less than $40K | Blue | 34 | 3 | 4 | 1 | 3313.000 | 2517 | 796.000 | 1.405 | 1171 | 20 | 2.333 | 0.760 |
| 4 | 709106358 | Existing Customer | 40 | M | 3 | Uneducated | Married | $60K - $80K | Blue | 21 | 5 | 1 | 0 | 4716.000 | 0 | 4716.000 | 2.175 | 816 | 28 | 2.500 | 0.000 |
CCchurn_df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 10127 entries, 0 to 10126 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CLIENTNUM 10127 non-null int64 1 Attrition_Flag 10127 non-null object 2 Customer_Age 10127 non-null int64 3 Gender 10127 non-null object 4 Dependent_count 10127 non-null int64 5 Education_Level 8608 non-null object 6 Marital_Status 9378 non-null object 7 Income_Category 10127 non-null object 8 Card_Category 10127 non-null object 9 Months_on_book 10127 non-null int64 10 Total_Relationship_Count 10127 non-null int64 11 Months_Inactive_12_mon 10127 non-null int64 12 Contacts_Count_12_mon 10127 non-null int64 13 Credit_Limit 10127 non-null float64 14 Total_Revolving_Bal 10127 non-null int64 15 Avg_Open_To_Buy 10127 non-null float64 16 Total_Amt_Chng_Q4_Q1 10127 non-null float64 17 Total_Trans_Amt 10127 non-null int64 18 Total_Trans_Ct 10127 non-null int64 19 Total_Ct_Chng_Q4_Q1 10127 non-null float64 20 Avg_Utilization_Ratio 10127 non-null float64 dtypes: float64(5), int64(10), object(6) memory usage: 1.6+ MB
Observations:
- There are 6 object type columns and the rest are numerical columns.
CLIENTNUM - Ignore
Categorical Variables : Attrition_Flag,Gender, Education_Level, Marital_Status,Income_Category, Card_Category
Numerical Variables : Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Credit_Limit,Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio
CCchurn_df.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| CLIENTNUM | 10127.000 | 739177606.334 | 36903783.450 | 708082083.000 | 713036770.500 | 717926358.000 | 773143533.000 | 828343083.000 |
| Customer_Age | 10127.000 | 46.326 | 8.017 | 26.000 | 41.000 | 46.000 | 52.000 | 73.000 |
| Dependent_count | 10127.000 | 2.346 | 1.299 | 0.000 | 1.000 | 2.000 | 3.000 | 5.000 |
| Months_on_book | 10127.000 | 35.928 | 7.986 | 13.000 | 31.000 | 36.000 | 40.000 | 56.000 |
| Total_Relationship_Count | 10127.000 | 3.813 | 1.554 | 1.000 | 3.000 | 4.000 | 5.000 | 6.000 |
| Months_Inactive_12_mon | 10127.000 | 2.341 | 1.011 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Contacts_Count_12_mon | 10127.000 | 2.455 | 1.106 | 0.000 | 2.000 | 2.000 | 3.000 | 6.000 |
| Credit_Limit | 10127.000 | 8631.954 | 9088.777 | 1438.300 | 2555.000 | 4549.000 | 11067.500 | 34516.000 |
| Total_Revolving_Bal | 10127.000 | 1162.814 | 814.987 | 0.000 | 359.000 | 1276.000 | 1784.000 | 2517.000 |
| Avg_Open_To_Buy | 10127.000 | 7469.140 | 9090.685 | 3.000 | 1324.500 | 3474.000 | 9859.000 | 34516.000 |
| Total_Amt_Chng_Q4_Q1 | 10127.000 | 0.760 | 0.219 | 0.000 | 0.631 | 0.736 | 0.859 | 3.397 |
| Total_Trans_Amt | 10127.000 | 4404.086 | 3397.129 | 510.000 | 2155.500 | 3899.000 | 4741.000 | 18484.000 |
| Total_Trans_Ct | 10127.000 | 64.859 | 23.473 | 10.000 | 45.000 | 67.000 | 81.000 | 139.000 |
| Total_Ct_Chng_Q4_Q1 | 10127.000 | 0.712 | 0.238 | 0.000 | 0.582 | 0.702 | 0.818 | 3.714 |
| Avg_Utilization_Ratio | 10127.000 | 0.275 | 0.276 | 0.000 | 0.023 | 0.176 | 0.503 | 0.999 |
Observations:
- CLIENTNUM - unique identifier and can be ignored.
- Customer_Age - Mean age is around 46 years. Min age is 26 years and Max age is 73 years. Median is 46 years
- Dependent_count - Mean is around 2. Min is 0 and max is 5. Median is 2
- Months_on_book - Mean and Median is around 36 months. Min is 13 and max is 56 months.
- Total_Relationship_count - Mean is 3.8 and Median is 4. Min is 1 and Max is 6
- Months_Inactive_12_mon - Mean is 2.3 and Median is 2. Min is 0 and Max is 6
- Contacts_Count_12_mon - Mean is around 2.5 and Median is 2. Min is 0 and max is 6
- Credit_Limit - Mean is 8632 dollars and Median is 4549 dollars. Min is 1438 and max is 34516 dollars.
- Total_Revolving_Bal Mean is 815 dollars and Median is 1276 dollars. Min is 0 and max is 2517 dollars.
- Avg_Open_To_Buy - Mean is 7469 dollars and Median is 3474 dollars. Min is 3 and max is 34516 dollars
-Total_Amt_Chng_Q4_Q1 - Mean is 0.76 and Median is 0.74. Min is 0 and max is 3.4
- Total_Trans_Amt - Mean is 4404 dollars and Median is 3899 dollars. Min is 510 dollars and max is 18484 dollars.
- Total_Trans_Ct - Mean is 64.9 and Median is 67. Min is 10 and max is 139
- Total_Ct_Chng_Q4_Q1 - Mean is 0.71 and Median is 0.70. Min is 0 and max is 3.7
- Avg_Utilization_Ratio - Mean is 0.28 and Median is 0.18. Min is 0 and max is 0.999
CCchurn_df.shape
(10127, 21)
Observations:
- There are 10127 rows in the dataset and 21 columns
CCchurn_df.isnull().sum()
| 0 | |
|---|---|
| CLIENTNUM | 0 |
| Attrition_Flag | 0 |
| Customer_Age | 0 |
| Gender | 0 |
| Dependent_count | 0 |
| Education_Level | 1519 |
| Marital_Status | 749 |
| Income_Category | 0 |
| Card_Category | 0 |
| Months_on_book | 0 |
| Total_Relationship_Count | 0 |
| Months_Inactive_12_mon | 0 |
| Contacts_Count_12_mon | 0 |
| Credit_Limit | 0 |
| Total_Revolving_Bal | 0 |
| Avg_Open_To_Buy | 0 |
| Total_Amt_Chng_Q4_Q1 | 0 |
| Total_Trans_Amt | 0 |
| Total_Trans_Ct | 0 |
| Total_Ct_Chng_Q4_Q1 | 0 |
| Avg_Utilization_Ratio | 0 |
Observations:
- Education_Level and Marital_Status columns have null values.
CCchurn_df.duplicated().sum()
0
Observations:
- There are no duplicates.
CCchurn_df['Attrition_Flag'].value_counts(normalize=True)
| proportion | |
|---|---|
| Attrition_Flag | |
| Existing Customer | 0.839 |
| Attrited Customer | 0.161 |
Observations:
- 83.9% are existing customers and 16.1% are attrited customers.
- This is an imbalanced dataset.
#Representing the target variable values as 0 and 1 instead of string values.
CCchurn_df['Attrition_Flag'] = CCchurn_df['Attrition_Flag'].map({'Existing Customer': 0, 'Attrited Customer': 1})
CCchurn_df['Attrition_Flag'].value_counts(normalize=True)
| proportion | |
|---|---|
| Attrition_Flag | |
| 0 | 0.839 |
| 1 | 0.161 |
Exploratory Data Analysis (EDA)¶
- EDA is an important part of any project involving data.
- It is important to investigate and understand the data better before building a model with it.
- A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
- A thorough analysis of the data, in addition to the questions mentioned below, should be done.
Questions:
- How is the total transaction amount distributed?
- What is the distribution of the level of education of customers?
- What is the distribution of the level of income of customers?
- How does the change in transaction amount between Q4 and Q1 (
total_ct_change_Q4_Q1) vary by the customer's account status (Attrition_Flag)? - How does the number of months a customer was inactive in the last 12 months (
Months_Inactive_12_mon) vary by the customer's account status (Attrition_Flag)? - What are the attributes that have a strong correlation with each other?
Univariate Analysis :¶
Categorical Variables : (Barplot on the below variables)
Attrition_Flag,Gender, Education_Level, Marital_Status,Income_Category, Card_Category
Numerical Variables : (Histplot and Boxplot on the below variables)
CLIENTNUM - Ignore
Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count, Months_Inactive_12_mon, Contacts_Count_12_mon, Credit_Limit,Total_Revolving_Bal, Avg_Open_To_Buy, Total_Amt_Chng_Q4_Q1, Total_Trans_Amt, Total_Trans_Ct, Total_Ct_Chng_Q4_Q1, Avg_Utilization_Ratio
The below functions need to be defined to carry out the Exploratory Data Analysis.¶
# function to plot a boxplot and a histogram along the same scale.
def histogram_boxplot(data, feature, figsize=(12, 7), kde=False, bins=None):
"""
Boxplot and histogram combined
data: dataframe
feature: dataframe column
figsize: size of figure (default (12,7))
kde: whether to the show density curve (default False)
bins: number of bins for histogram (default None)
"""
f2, (ax_box2, ax_hist2) = plt.subplots(
nrows=2, # Number of rows of the subplot grid= 2
sharex=True, # x-axis will be shared among all subplots
gridspec_kw={"height_ratios": (0.25, 0.75)},
figsize=figsize,
) # creating the 2 subplots
sns.boxplot(
data=data, x=feature, ax=ax_box2, showmeans=True, color="violet"
) # boxplot will be created and a triangle will indicate the mean value of the column
sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2, bins=bins, palette="winter"
) if bins else sns.histplot(
data=data, x=feature, kde=kde, ax=ax_hist2
) # For histogram
ax_hist2.axvline(
data[feature].mean(), color="green", linestyle="--"
) # Add mean to the histogram
ax_hist2.axvline(
data[feature].median(), color="black", linestyle="-"
) # Add median to the histogram
Observation on Customer_Age
histogram_boxplot(CCchurn_df, 'Customer_Age')
Observations :
- Customer Age is uniformly distributed.
- There are few outliers.
- Mean and Median age is pretty close and is around 46.
Observation on Dependent_count
histogram_boxplot(CCchurn_df, 'Dependent_count')
Observations:
- Dependent count is uniformly distributed.
- There are no outliers.
Observation on Months_on_book
histogram_boxplot(CCchurn_df, 'Months_on_book')
Observations:
- Months on book is uniformly distributed.
- Number of customers peak at around 36 months.
- There are outliers.
Observation on Total_Relationship_Count
histogram_boxplot(CCchurn_df, 'Total_Relationship_Count')
Observations:
- There are no outliers
- More number of customers have Total relationship count -3
Observation on Months_Inactive_12_mon
histogram_boxplot(CCchurn_df, 'Months_Inactive_12_mon')
Observations:
- Months_Inactive_12_mon is right skewed.
- There are outliers.
- More number of customers have 3 months inactive in the last 12 months.
Observation on Contacts_Count_12_mon
histogram_boxplot(CCchurn_df, 'Contacts_Count_12_mon')
Observations:
- Contacts_Count_12_mon is right skewed.
- There are outliers.
Observation on Credit_Limit
histogram_boxplot(CCchurn_df, 'Credit_Limit')
Observations:
- Credit limit is right skewed.
- There are outliers.
Observation on Total_Revolving_Bal
histogram_boxplot(CCchurn_df, 'Total_Revolving_Bal')
Observations:
- Total Revolving Bal is left skewed.
- There are no outliers.
Observation on Avg_Open_To_Buy
histogram_boxplot(CCchurn_df, 'Avg_Open_To_Buy')
Observations:
- Avg_Open_To_Buy is right skewed.
- There are outliers.
Observation on Total_Amt_Chng_Q4_Q1
histogram_boxplot(CCchurn_df, 'Total_Amt_Chng_Q4_Q1')
Observations:
- Its a right skewed distribution.
- There are outliers.
Observation on Total_Trans_Amt
histogram_boxplot(CCchurn_df, 'Total_Trans_Amt')
Observations:
- Its a right skewed distribution.
- There are outliers.
Question 1: How is the total transaction amount distributed?
Answer:
- Its a right skewed distribution.
- Small number of customers have significantly have total_trans_amount.
- Most of the customers have lower total transaction amount(<5000 dollars).
- There are some outliers as shown in the boxplot and those are customers with significantly higher transaction amount.
Observation on Total_Trans_Ct
histogram_boxplot(CCchurn_df, 'Total_Trans_Ct')
Observations:
- There are few outliers.
- its a left skewed distribution.
Observation on Total_Ct_Chng_Q4_Q1
histogram_boxplot(CCchurn_df, 'Total_Ct_Chng_Q4_Q1')
Observations:
- Its right skewed distribution.
- There are outliers.
Observation on Avg_Utilization_Ratio
histogram_boxplot(CCchurn_df, 'Avg_Utilization_Ratio')
Observations:
- Its a right skewed distribution.
- There are no outliers.
# function to create labeled barplots
def labeled_barplot(data, feature, perc=False, n=None):
"""
Barplot with percentage at the top
data: dataframe
feature: dataframe column
perc: whether to display percentages instead of count (default is False)
n: displays the top n category levels (default is None, i.e., display all levels)
"""
total = len(data[feature]) # length of the column
count = data[feature].nunique()
if n is None:
plt.figure(figsize=(count + 1, 5))
else:
plt.figure(figsize=(n + 1, 5))
plt.xticks(rotation=90, fontsize=15)
ax = sns.countplot(
data=data,
x=feature,
palette="Paired",
order=data[feature].value_counts().index[:n].sort_values(),
)
for p in ax.patches:
if perc == True:
label = "{:.1f}%".format(
100 * p.get_height() / total
) # percentage of each class of the category
else:
label = p.get_height() # count of each level of the category
x = p.get_x() + p.get_width() / 2 # width of the plot
y = p.get_height() # height of the plot
ax.annotate(
label,
(x, y),
ha="center",
va="center",
size=12,
xytext=(0, 5),
textcoords="offset points",
) # annotate the percentage
plt.show() # show the plot
Observation on Attrition_Flag
labeled_barplot(CCchurn_df, 'Attrition_Flag')
Observations:
- Existing customers are high in number when compared to attrited customers.
Observation on Gender
labeled_barplot(CCchurn_df, 'Gender')
Observations:
- There are more female customers when compared to male customers.
Observation on Education_Level
labeled_barplot(CCchurn_df, 'Education_Level')
Question 2: What is the distribution of the level of education of customers?
Answer : The above plot shows the distribution of the level of education of customers.
- Highest number of customers have graduate degree followed by high school degree and uneducated customers.
- Least number of customers have Doctorate.
Observation on Marital_Status
labeled_barplot(CCchurn_df, 'Marital_Status')
Observations:
- There are more married customers.
- There are least number of Divorced customers.
Observation on Income_Category
labeled_barplot(CCchurn_df, 'Income_Category')
Question 3: What is the distribution of the level of income of customers?
Answer : The above barplot shows the distribution of the level of income of customers.
Highest number of customers have income less than $40k followed by customers with income 40k-60k.
Least number of customers( 727 customers )with high income - $120k+ .
We observe that there is a invalid category "abc" which will be treated appropriately in model pre-processing section.
Observation on Card_Category
labeled_barplot(CCchurn_df, 'Card_Category')
Observations:
- Highest number of customers have Blue Card.
- Least number of customers have Platinum card.
Bivariate Analysis¶
sns.pairplot(data=CCchurn_df,hue='Attrition_Flag')
<seaborn.axisgrid.PairGrid at 0x7c98eeec4a50>